Eecient Exploration for Optimizing Immediate Reward

نویسنده

Dale Schuurmans

چکیده

We consider the problem of learning an eeective behavior strategy from reward. Although much studied, the issue of how to use prior knowledge to scale optimal behavior learning up to real-world problems remains an important open issue. We investigate the inherent data-complexity of behavior learning when the goal is simply to optimize immediate reward. Although easier than reinforcement learning, where one must also cope with state dynamics , immediate reward learning is still a common problem and is fundamentally harder than supervised learning. For optimizing immediate reward, prior knowledge can be expressed either as a bias on the space of possible reward models, or a bias on the space of possible controllers. We investigate the two paradigmatic learning approaches of indirect (reward-model) learning and direct-control learning, and show that neither uniformly dominates the other in general. Model-based learning has the advantage of generalizing reward experiences across states and actions, but direct-control learning has the advantage of focusing only on potentially optimal actions and avoiding learning irrelevant world details. Both strategies can be strongly advantageous in diierent circumstances. We introduce hybrid learning strategies that combine the beneets of both approaches, and uniformly improve their learning ee-ciency.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Efficient exploration for optimizing immediate reward

We consider the problem of learning an effective behavior strategy from reward. Although much studied, the issue of how to use prior knowledge to scale optimal behavior learning up to real-world problems remains an important open issue. We investigate the inherent data-complexity of behavior-learning when the goal is simply to optimize immediate reward. Although easier than reinforcement learni...

متن کامل

Boredom, Information-Seeking and Exploration

Any adaptive organism faces the choice between taking actions with known benefits (exploitation), and sampling new actions to check for other, more valuable opportunities available (exploration). The latter involves informationseeking, a drive so fundamental to learning and long-term reward that it can reasonably be considered, through evolution or development, to have acquired its own value, i...

متن کامل

Optimal Exploration-Exploitation in a Multi-Armed-Bandit Problem with Non-stationary Rewards

In a multi-armed bandit (MAB) problem a gambler needs to choose at each round of play one of K arms, each characterized by an unknown reward distribution. Reward realizations are only observed when an arm is selected, and the gambler’s objective is to maximize his cumulative expected earnings over some given horizon of play T . To do this, the gambler needs to acquire information about arms (ex...

متن کامل

Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards

متن کامل

Information-Seeking, Learning and the Marginal Value Theorem: A Normative Approach to Adaptive Exploration

Daily life often makes us decide between two goals: maximizing immediate rewards (exploitation) and learning about the environment so as to improve our options for future rewards (exploration). An adaptive organism therefore should place value on information independent of immediate reward, and affective states may signal such value (e.g., curiosity vs. boredom: Hill & Perkins, 1985; Eastwood e...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1999

Eecient Exploration for Optimizing Immediate Reward

نویسنده

چکیده

منابع مشابه

Efficient exploration for optimizing immediate reward

Boredom, Information-Seeking and Exploration

Optimal Exploration-Exploitation in a Multi-Armed-Bandit Problem with Non-stationary Rewards

Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards

Information-Seeking, Learning and the Marginal Value Theorem: A Normative Approach to Adaptive Exploration

عنوان ژورنال:

اشتراک گذاری